3 research outputs found
How Universal is Genre in Universal Dependencies?
This work provides the first in-depth analysis of genre in Universal
Dependencies (UD). In contrast to prior work on genre identification which uses
small sets of well-defined labels in mono-/bilingual setups, UD contains 18
genres with varying degrees of specificity spread across 114 languages. As most
treebanks are labeled with multiple genres while lacking annotations about
which instances belong to which genre, we propose four methods for predicting
instance-level genre using weak supervision from treebank metadata. The
proposed methods recover instance-level genre better than competitive baselines
as measured on a subset of UD with labeled instances and adhere better to the
global expected distribution. Our analysis sheds light on prior work using UD
genre metadata for treebank selection, finding that metadata alone are a noisy
signal and must be disentangled within treebanks before it can be universally
applied.Comment: Accepted at SyntaxFest 202